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ABSTRACT 

Methods of probability modeling to analyze rater 
agreement are described, emphasizing their basic simiJarities and 
viewing uhem as variants of a common methodology. Statistical 
techniques for analyzing agreement data are described to address 
questions such as how many opinions are required to make a medical 
diagnosis with necessary accuracy. Kappa and other agreement indices, 
variance components approaches, and latent structure models are 
considered. Focus is on two related techniques, which differ in 
assumptions about disease subtypes and associated differences among 
cases in their ability to be correctly diagnosed: (1) latent class 
agreement analysis? and (2) latent trait agreement analysis. 
Specifically, these methods make it possible to determine from the 
opinion of panels of diagnosticians in an agree ^ent study the 
following: the probable accuracy of an individual diagnosis? the 
probability of disease presence or absence given unanimous or 
conflicting opinions by several diagnosticians? and ho^ many opinions 
should be required to make the diagnosis. It is concluded that 
because the estimation procedures and software are better developed 
for the latent class agreement model, investigators should pursue 
this approach first. Eleven data tables and a 54-item list of 
references are included. (RLC) 
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PREFACE 



TTie probability modeling approach to analyzing rater agreement has 
emerged in the literature in a sonjewhat disjointed manner, with 
different models being proposed by various authors, complicating the 
task of the researcher who wishes to acquire familiarity with these 
methods and apply them in his or her research. The goal of this Note is 
to describe these approaches, emphasizing their basic similarities and 
viewing them as variants of a common methodology. 

This Note should be of use both to the applied researcher who is 
interested in analyzing rater agreement data, and the technically 
oriented reader concerned with methods for analyzing agreement. 
Accordingly, not all sections are intended for all readers--the former 
group may find some sections to contain more detail than they require, 
and the latter may find some material redundant. Readers more concerned 
with substantive applications may want to concentrate on the 
introductory portions of each section and the computational examples. 

Although an attempt has been made to be as comprehensive as 
possible in surveying previous work in this area, no doubt there are 
important contributions to this literature that have inadvertently been 
overlooked. 
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SUMMARY 



How do we know how many opinions are required to make a diagnosis 
with necessary accuracy? One way is by examining how often physicians 
agree on the diagnosis. This Note discusses statistical techniques that 
can be used to analyze agreement data to address this and related 
questions. Specifically, these methods make it possible to determine 
from the opirions of panels of diagnosticians in an agreement study the 
following: (1) the probable accuracy of an individual diagnosis; (2) the 
probability of disease prosence or absence given unanimous or 
conflicting opinions by several diagnosticians; and (3) how many 
opinions should be required to make the diagnosis. The methods 
discussed include two related techniques, which differ in assumptions 
about disease subtypes and associated differences £unong cases in their 
ability to be correctly diagnosed. These techniques have many 
applications in addition to that of medical diagnosis. 
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I. INTRODUCTION 



Powerful methods for measuring agreement on diagnosis and related 
forms of classification now exist. The origin of these methods can be 
traced as far back as Poisson's studies of juror agreement [1], and they 
are closely related to the well known statistical techniques of latent 
class and latent trait analysis [?.-4], That th-^jy are more computation- 
intensive than traditional approaches to measuring agreement has 
probably been a factor in their not yet having received widespread use. 
However, because of advances in microcomputer hardware and software, 
they are now well within reach of most researchers, and offer 
considerable promise for leading to better use of agreement data than 
has previously been possible. 

To fully appreciate the usefulness of these methods ard their 
advantages relative to other ways of measuring agreement, it is helpful 
to consider them in light of a hypothetical example.^ Suppose that a 
patient is diagnosed as having a rare disease. Immediately there are 
several questions that come to mind, foremost among them being: 

• How likely is it that this diagnosis is correct? 

Suppose, though, that this question cannot be answered directly, 
since there is no definitive test for the disease. The questions then 
asked might be as follows: 



*The hypothetical example, as well as much of our discussion, 
focuses on medical diagnosis as an instance of expert rating. However, 
it is understood that what is said applies equally well to other types 
of dichotomous classifications, e.g., the designation of a defendant as 
guilty or not guilty by jurors or the categorization of parts as 
operative or nonoperative by inspectors. 
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To what extent do diagnoses by this diagnostician tend to 
reflect the judgment of other or most diagnosticians? 



and 



Given that a single diagnosis is subject to error, how 
worthwhile would it be to obtain additional opinions, for 
example, a diagnosis by a second or even third independent 
source? 



The latter question, in turn, gives rise to another; 



Given opinions by several diagnosticians, which may include 
both positive and negative diagnoses, what is the probability 
of having or not having the disorder? 



The practical nature of these questions hardly needs to be 
emphasized. Data are routinely collected by means of inter-rater 
agreement studies for the specific purpose of answering them. In its 
basic form, such a study consists of a sample of N cases, each evaluated 
by two or more diagnosticians. The subset of such studies we are mainly 
concerned with here are those where (1) the number of opinions remains 
constant across cases, (2) diagnosticians formulate their opinions 
independently of one another, and (3) evaluations take the form of 
dichotomous ratings, for example, "disorder present" and "disorder 
absent." More complex models, such as those involving multiple or 
graded response categories, may be derived from this simplified model. 

As an example of such a study, consider the data in Table 1.1. 
These data, originally presented by Yerushalmy [5] , concern ratings of 
radiographic films as either indicative or not indicative of 
tuberculosis by eight physicians each. As shown, the majority of cases 
received eighc negative ratings, with a smaller number receiving eight 
positive ratings. However, disagreement is also indicated by the cases 
receiving various combinations of positive and negative diagnoses. 



ERIC 



11 



3 



Table 1.1 
EXAMPLE DIAGNOSTIC AGREEMENT DATA 



iiuniDer ot 


Ubserved 


Positive 


Frequency 


Diagnoses j 




0 


13560 


1 


877 


2 


168 


3 


66 


4 


42 


5 


28 


6 


23 


7 


39 


8 


64 



SOURCE: From Ref. 5. 

NOTE: Each case diagnosed 
eight times. Total number of 
cases is 14,867. 



Surprisingly, although studies such as this which collect multiple- 
rater agreement data are common, particularly in medical research [6-8], 
traditional methods for analyzing this information are not well suited 
to addressing the above-posed questions. Our purpose here is to 
describe and illustrate a statistical approach that makes it possible to 
answer these questions much more directly and precisely. Much of what 
we present is not new--rather, elements are to be found scattered 
throughout a diverse literature in statistics, medicine, psychology, 
sociology, and education. We attempt here to weave these elements into 
a coherent set of techniques that may be properly viewed as a 
methodology y rather than simply a set of methods. 
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ORGANIZATION OF THIS NOTE 



In the remainder of this section, we ri^view previous approaches for 
the measurement of agreement. Following this, we explain the basic 
rationale of the approach considered here, and present a taxonomy of 
specific techniques subsumed under the general model. In tho next two 



Several methods have previously been proposed for measuring 
agreement. The most common approach is the calculation of an agreement 
index. The most elementary such index consists of the proportion of 
times two ratings of the same case agree. Other agreement indices 
include Yule's Y [9], tue odds ratio [10],* and the phi coefficient. 

A widely known class of agreement indices is obtained by dividing 
the difference between observed pairwlse agreement and the level 
expected by chance by 1 minus the level expected by chance [12], 
Various indices of this class differ in how the expected proportion of 
chance agreement is calculated. Foremost among chese indices is the 
kappa coefficient [13, 14], which estimates chance agreement based on 
the , ^ Axxct of the marginal proportions of positive and negative 
ratings. Although kappa has been widely used, and although its 
usefulness for verifying that observed levels of agreement exceed chance 
levels is clearly established, concern has been expressed about its 
potential limii^ations. Several authors have discussed what has been 
termed the base rate problem [15-17], whereby a rating procedure with a 
high level of accuracy may yield low levels of agreement as measured by 
the kappa coefficient in samples where the proportion of positive and 
negative ratings (i.e., the base rates) are close to 1 and 0.' 

'Darroch and McCloud [11] also develop a more extensive methodology 
for analyzing rater agreement based on the odds ratio. 

'Shrout, Spitzer, and Fleiss [18], however, contended that this is 
in fact a desirable prc^>erty. 



sections, we present the two main variants of this approach. 



METHODS FOR MEASURING AGREEMENT 
Kappa aid Othtr AgrMrmnt Indices 
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A more fundamental limitation of agreement indices in general is 
that they summarize all of the information on agreement and disagreement 
in a single term. Thus, agreement on positive ratings and agreement on 
negative ratings are subsumed under one index, wh^lch results in loss of 
useful information. Further, because of the lack of an explicit 
probability model underlying their calculation, such measures do not 
readily permit agreement data to be used to answer the types of 
questions posed above. 

Variance Components Approaches 

An alternative is to express agreement on dichotomous 
classifications in a manner analogous to the intr^aclass correlation used 
to assess the reliability of interval or ratio scale measures [19-2] 1. 
By this approach, positive and negative ratings are coded 1 and 0, and 
the proportion of total variation among ratings that is attributable to 
between-case variability (i.e., not attributable to variation in ratings 
of the same case) is calculated as a measure of classification precision 
or reliabilHy. Limitations characteristic of intraclass correlation 
approaches to expressing rating reliability in general, however, apply 
here as well. Specifically, raters with a given degree of consistency 
in the absolute sense of tending to agree or disagree on ratings of the 
same type o*" case will yield higher or lower intraclass correlations, 
depending upon the level of between-case variation, which is a fxinction 
of the prevalence of positive and negative cases in the sample. This is 
directly analogous to the base rate problem of the kappa coefficient, 
just as the kappa coefficient itself is closely related to the 
intraclass correlation coefficient. Again, howev-ar, the more important 
limitation of variance-partitioning approaches is that they do not 
express agreement in a way that lends itself to answering the questions 
posed above. 
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Latent Structure Models 

The methods we consider here fall under the general heading of what 
may be termed Latent Structure Agreement Analysis, These methods 
approach the problem of quantifying agreement from a much different 
perspective than agreement indices or variance-partitioning methods. 
Specifically, they develop a parameterized model, which entails an 
explicit characus^rization of the relationship between individual rater 
accuracy and inter-rater agreement. In essence, these methods may be 
understood as attempting to answer the question, What level of rater 
accuracy would be required to generate a pattern of agreements and 
disagreements such as that observed? This approach views the accuracy 
of raters and the prevalence of various types of cases as unobserved 
parameters, and estimates these parameters based on observed data. Once 
derived, ther3 estimates can be applied in Bayesian calculations to 
provide answers to the kinds of questions posed above. 

Four main variants of this approach have thus far been suggested in 
the literature. Agreement data may be collected using a research design 
where multiple opinions for each case come from eithei the same or 
differing sets of raters. We refer to these as fixed and varying rating 
panel designs, respectively. 

In addition, a disorder may be viewed as discrete or continuous . 
By the former view, cases are seen as belonging to one of a relatively 
small set of categories or types, each of which corresponds to a certain 
trait level and has an associated probability of eliciting a positive 
rating. By the continuous view, cases are assumed to have trait levels 
and corresponding probabilities of eliciting positive ratings that may 
fall anywhere on a continuum. From the former assumption comes an 
approach to analyzing agreement that may be recognized as a special case 
of latent class &.^alysis [2-4]; accordingly, we term the models in this 
category Latent CIsss Agreement Analysis. Methods basid on the 
assumption of a continuously varying trait, in turn, may be seen as a 
special case of latent trait analysis [2, 22], and are therefore termed 
Latent Trait Agreement Analysis. 
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Either latent structure wodel may be combined with either rating 
design, leading to four main variants of the Latent Structure Agreement 
Analysis approach. These are termed the varying panel I latent class y 
fixed oanel I latent class, varying panel /latent traits and fixed 
panel /latent trait models, respectively • 
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II. LATENT CLASS AGREEMENT ANALYSIS 



VARYING RATING PANEL 

This approach corresponds to discussions of rater agreement by 
Gelfand and Solomon (1, 23], Kaye [24], Kraemer [25], and Uebersax 
[26]. » 

We first describe the basic approach and discuss the estimation ar.d 
comparison of models, and then discuss the application of derived 
parameters to the estimation of rating accuracy and the interpretation 
of multiple opinions. Following this, we consider a computational 
example. 

Mode! 

Let N cases each be evaluated by randomly selected groups of * 
raters, and let each rater's evaluation take the form of a dichotomous 
rating, e.g., a positive or negative diagnosis. Recall (as illustrated 
in Table 1.1) that out of k ratings for each case, any number J 0=0, 

^ Ar) may be positive. Considering outcomes across all cases, the 

iTrequencies of cases with each possible number of positive ratings may 
be obtained, denoted f^, f^, .... !„ accordance with the 

assumptions of latent class analysis, these frequencies are assumed to 
be determined by two sets of parameters: the prevalences of c mutually 
exclusive and exhaustive latent classes to which cases belong (lateat 
class prevalences) and the conditional probabilities of a positive 
rating, given a case belonging to each (conditional rating 
probabilities). Each latent class is assumed to correspond to a type of 
case with a specific probability of eliciting a positive rating. Thus, 
latent classes may represent genetically different subtypes of a 
disorder, or functional groupings of cases based on levels of symptom 

^Stewart and Rey [27] and Fleiss and Shrout (28] also discuss 
methods that are similar, but require that estimates of the prevalence 
of positive and negative cases be available. 
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intensity or salience, for example, categories of "not symptomatic/' 

"moderately sympi-omatic," and "highly symptomatic." 

The prevalences of each latent class are denoted by ir^, ir2, 

ir^, where (s = 1, 2, . . . , c) is the probability of a randomly sampled 

case belonging to latent class s.* For convenience, we indicate a 

positive rating by 1 and a negative rating by 0. We denote the 

conditional rating probabilities by ir, Tf,,o> '^m , where 

1 1 1 1 1 ^ 1 |c 

Tfjl^ is the probability of a positive rating given a member of latent 
class s.' By convention, we number latent classes in order of 
increasing probability of eliciting a positive rating, i.e., such that 



The expected number of cases receiving exactly j out of k positive 



This leads to the set of expected frequencies of cases with various 
numbers of positive ratings, e^, e^, e^. The goal, then, is to 

obtain estimates of latent class prevalences and conditional rating 



A special case x these models occurs when it is assumed that 
there is only one class to which cases belong. This situation, which we 
shall be concerned with primarily in conjunction with the evaluation and 
comparison of latent class models, is related to the log-linear models 
for agreement analysis described by Tanner and Young [29]. 

^In the general case, this model requires that raters be randomly 
sampled for ea^h case. Because each rating is thus a random sampling, 
the probability of a positive rating for a given case remains constant, 
even though raters themselves may differ in their tendency to make 
positive or negative ratings (in the fixed panel design discussed below, 
allowances are .nade for rater differences). The varying panel model, 
however, is also applicable under other circumstances --for example, if 
the same test is repeated on multiple occasions, or the same set of 
raters evaluate each case, but their probabilities of making positive 
ratings, conditional on latent class, are the same. The essential 
feature of the varying panel model is that, for a given case, the value 
of ir. I is always the same. 



'ill ^ *"l|2 ^ ■ • ^ ^|c- 



ratings, denoted e is equal to 




(1) 
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probabilities, i.e., ir^ and Tr^i^ parameters, that maximize the 
correspondence between observed and expected frequencies of cases with 
various numbers of positive ratings, that is, the and terms. 

Estimation 

Standard numerical estimation procedures can be used to find the 
parameter values that maximize this correspondence. Uebersax [26] 
described a procedure for obtaining maximum likelihood estimates based 
on the Newton -Raphson method, following the general approach of 
Lazarsfeld and Henry [2], More recently, we have used an EM algorithm 
[30] related to that described by Goodman [3] and Dawid and Skene [31].^ 
This algorithm is more flexible than the Newton -Raphson method, but 
tends to converge more slowly. A compromise is to initially apply the 
EM algorithm to obtain good approximations to maximum likelihood 
parameter estimates, and then to use these as starting values for the 
Newton -Raphson procedure, which converges more rapidly on final 
estimates. Approximate standard errors of parameter estimates are 
obtained by the standard method of inverting the information matrix [2]. 

Identifiability 

Lazarsfeld and Henry [2], Goodman [3], and others discuss 

identifiability of latent class models. An unidentifiable model is 

analogous to a set of equations where there are more independent 

equations than variables, permitting an infinite number of solutions. A 

necessary condition for model identifiability is that the number of 

parameters requiring estimation be less than the number of degrees of 

freedom for the observed data. Givon k ratings per case, there are k + 

1 possible numbers of positive ratings, but only k degrees of freedom, 

since /q + + . . , + = Thus, for k ratings per case, there can 

be no more than k parameters requiring estimation. The parameters 

requiring estimation are c - 1 of the ir terms (one need not be 

s 

'^A description of this algorithm as applied to the varying panel 
model is presented in [32]. 



estimated, since they must sum to 1) and the c ir. terms. Thus, for a 

1 \s 

model with c latent classes to be identifiable, it is necessary that k i 
2c - 1, unless constraints are imposed on parameters. Table 2.1 shows, 
for varying panel models with various numbers of latent classes, the 
minimum number of ratings per case necessary for a model to be 
identifiable. It may be noted that for a model with two latent classes, 
at least three ratings per case are required to estimate model 
parameters (although four ratings would also permit a test of model 
fit). Further, it has been our experience (and also noted by Kraemer 
[25]) that two-class models are often not sufficient to characterize the 
complexity of a rating process. We have typically found models with 
three or four classes more suitable. 

Satisfying the condition above is usually a necessary but not a 
sufficient condition for laterit class model identif iability. The 
varying panel agreement model with dichotomous ratings, however, is a 
relatively simple application of the general latent class model. 
Experience thus far suggests that for this clasa of models the necessary 
condition above is also a sufficient condition, except in certain 
trivial cases, e.g., when all cases are unanimously rated positive or 
negative, or when data that are fit perfectly by a model with a smaller 



Table 2.1 



MINIMUM RATINGS PER CASE REQUIRED FOR MODEL IDENTIFI ABILITY 
AND ASSESSMENT OF FIT: VARYING PANEL DESIGN 



Number of 
Latent Classes 



Ratings per Case 
Required for Model 
Identifiability 



Ratings per Case 
Required for Chi- 
Square Test 



1 
2 
3 
4 



1 

3 
5 
7 



2 
4 
6 
8 



NOTE: Assumes unconstrained model. 
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number of latent classes are analyzed using a model with a larger number 
of classes. 

Constraints on Model Parameters 

It is often possible to estimate a model not otherwise identifiable 

by imposing constraints on model parameters. For varying panel designs, 

the most common constraint involves setting one or more parameters to 

specified values. For example, one may set if-, to 0 or 1 for a 

1 \s 

particular latent class. For examples of parameter constraints in the 
estimation of latent class agreement models, see Refs. 33 and 34. 

Model Fit and Comparison 

The fit of a latent class agreement model may be assessed by 

comparing observed results to what would be expected by the model, using 

either a Pearson or likelihood ratio chi-square statistic [3]. The 

Pearson chi-squar'» is calculated by the formula X* = [!.(/. - e, 

J J J 

and the likelihood ratio chi-square by the formula = 21 .f Aozif Je ,) * 

J J ^ r J 

where the values for e. are calculated using estimates of ir and if« , 

J s l\s 

parameters. The degrees of freedom associated with each is /r - 1 minus 
the number of estimated parameters. For unconstrain^.^ models, this is 
equal to /: - 2c + 1. Model fit is indicated by a low value relative to 
the degrees of freedom, i.e., a nonsignificant value. Statistical 
significance may be determined from standard tables of the X* statistic. 

An advantage of the likelihood ratio chi-square is that it permits 
comparison of alternative models of the same data. The statistical 
significance of the difference between two models is evaluated by 
subtracting their corresponding likelihood ratio chi-squares. The 
degrees of freedom for the resulting difference statistic is equal to 
the difference in the degrees of freedom for the individual chi-squares. 
This requires that the models compared be nested, i.e., that the 
parameters of one be a subset of those of the other. This is always the 
case for models that differ only in the number of latent classes. 
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In assessing model fit it is important to take sample size into 
account. Given a sufficiently large sample, even a small difference 
between observed and expected frequencies will likely result in 
significant chi-square values. Thus, it may also be useful to assess 
fit in terms of statistics such as the normed fit index [35], which are 
less sample size dependent. Clogg [36] recommends an equivalent index, 
calculated as (L^* - where L^^ is the likelihood ratio chi- 

square for a given latent class model, and is the corresponding 
statistic obtained using a one-class (independence) model. This is 
analogous to the proportion of variance unexplained by the on^-class 
model that is explained by the multiple-class model 

Having described the parameters of the varying rating panel latent 
class agreement model and discussed methods by which parameters are 
estimated and models evaluated, we now proceed to the subject of how 
these estimates can be used to address the questions concerning the 
accuracy and interpretation of ratings initially posed. 

Estimation of Rating Accuracy 

In the definition of latent classes wo stated that each corresponds 
to a subset of cases with similar trait levels and probabilities of 
eliciting a positive rating. If each latent class can be interpreted as 
a variety of positive or negative case, model parameter estimates may be 
used to directly estimate rating accuracy.' 

The av-curacy of dichotomous ratings is commonly expressed in terms 
of four indices: sensitivity (Se), specificity (Sp) , positive 
predictive validity (/Vf), and negative predictive validity (/V-). 
Rating sensitivity is defined as the probability of a positive rating 
given a positive case. Rating specificity is the probability of a 

'Alternatively, each latent class may be viewed as a specific 
mixture of positive and negative cases. In this situation, slightly 
more complex formulas than those presented here are required, but, as 
discussed in Uebersax [26], it is generally possible to derive at least 
upper bound estimates for rating accuracy using these methods. 
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negative rating given a negative case. Positive and negative predictive 
validity are defined as the reverse conditional probabilities of 
sensitivity and specificity. That is, po itive predictive validity is 
the probability of a positive case given a positive rating and negative 
predictive validity is the probability of a negative case given a 
negative rating. By denoting a positive and negative case + and and 
a positive and negative rating and we may define Se = 

Pr[ Sp = Pr['.'|.], /Vf = Pr[+r + '], and /V- = Pr[-|'-']. 

For a given model, let the niimbers a and b be such that latent 
classes 1, 2, a are subtypes of negative cases, and latent classes 

Z), Z) + ], c are subtypes of positive cases. Se, Sp, /V^, and /V- 

are then obtained as follows: 

c 

Z IT IT, , 

s»b s 1 1 5 

Se = , (2) 

c 

Z IT 

s=i> s 

a 

Z, IT (1 - IT, , ) 

s«l l\s' 

Sp = , (3) 

a 

I IT 

s»l s 

c 

S^b ^ 

PV+ = , (4) 

c 

Z IT IT, , 
S«l ^ 

and 

/V- = . (5) 

c 

Z If (1 - If, , ) 

Also of interest is the false'-negative error rate^ or the 
probability of a negative rating given a positive case, equal to 1 
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5e, and the false- pusitive error rate, or the probability of a positive 
rating given a negative case, equal to 1 - Sp. 

Interpreting Multiple Opinions 

One of its roost useful features is that the latent class approach 
to analyzing agreement leads directly to methods for the interpretation 
and integration of opinions by multiple raters. Again, let latent 
classes be assumed to be varieties of negative and positive cases. 
Simple Bayesian calculations show that the probability of a case being 
positive, i.e., belonging to one of the positive latent classes, given 
exactly j out of k positive ratings, is equal to 

I If IT, , •'(1 - If,, )*"-^ 

Pr( + |j' = j] = — -, (6) 

I If If, , -^(1 - If, , )*'-^ 

where j' is a variable to denote the number of positive ratings observed 
for a case. Subtracting this from one, the probability of a case being 
negative given J out of * positive ratings is obtained. This equation 
can be used to classify cases in the original rating study as positive 
or negative. By consideration of other values for /r, it may also be 
used to derive classification rules for future cases based on different 
numbers of ratings. 



Number of Opinions Necessary for Required Accuracy 

The above formula is easily applied to determine the number of 
opinions necessary to insure a required degree of classification 
accuracy. Suppose, for example, that a sufficiently accurate 
classification is defined as one with a certain positive predictive 
validity. One may then, for example, ask what the minimum number of 
ratings is such that the probability of a case being positive, given 
unanimous positive ratings, is greater than or equal to this value. The 
situation of unanimous positive ratings may be seen as a special case of 
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the above where J ^ k. Thus this formula can be used to estimate the 
positive predictive validity of unanimous positive ratings by panels of 
1 rater, 2 raters, etc. The minimum panel size necessary to classify a 
case positive with the required accuracy would then be the smallest 
number needed for (6) to exceed the criterion established. By extension 
of this reasoning, one may allow for non^unanimous panel outcomes or use 
other criteria for minimal required accuracy in determining panel size. 

We have shown how parameter estimates obtained with the latent 
class agreement model can be used to estimate rater accuracy, 
probabilistically interpret opinions by multiple raters, and determine 
an appropriate number of opinions for a sufficiently accurate 
classification. Of necessity, we consider only some of the applications 
possible. Many others are implicit in the ability of these methods to 
provide direct or upper bound estimates of rating accuracy. For 
example, estimated rater accuracy can be used to determine the expected 
attenuation in statistical power of comparisons that involve groups 
whose members are assigned on the basis of fallible ratings [37], 
estimate the decrease in apparent accuracy of a diagnostic test compared 
to a criterion diagnosis that is itself unreliable [38], or correct for 
bias in estimation of disease prevalence due to misclassif ication error 
[39]. 

Software 

Varying panel latent class agreement models can be estimated with 
the PANEL microcomptiter program. We document this program in a 
companion RAND Note [32]. 

Varying Number of Ratings per Caaa 

The varying panel latent class model may be generalized to designs 
where cases are rated different numbers of times. Friamples of this 
occur when some ratings are lost or only some cases in a study are 
multiply rated. Let be the m/»ximum number of ratings any ca^^e 
receives. We may summarize results of an agreement study as the 
proportion of cases that are rated /r (* = 1, 2, . . . , AT) times and 
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receive j (7 = 0, 1, *) positive ratings. The EM algorithm may 

again be used to obtain maximum likelihood estimates of ir and ir . 
parameters. 

Pearson and likelihood ratio chi-square statistics can be used to 
test model fit. These may be viewed as the sum of separate chi-squares 
for cases with each number of ratings. Assessment of statistical 
significance for a test of model fit is complicated by the fact that 
outcomes for various values of * co Id not be interpreted as resulting 
from independent multinomials. To test statistical significance 
requires that a common multinomial be estimated. It is not difficult to 
see how this can be done. Given an underlying multinomial fur results 
with K ratings, expectations of results with k < K ratings are obtained 
using a formula related to the hypergeometric distribution (Uebersax 
[26], Equation 6). Thus, a likelihood function may be constructed for 
results across all values of * given probabilities for the if -way 
multinomial. These probabilities may then be estimated from observed 
data using a numerical procedure such as the Newton-Raphson method. An 
analytic method for estimating the common underlying multinomial may 
also be possible. Chi-square statistics are calculated by comparison of 
the proportions of cases with various combinations of k and J expected 
given the latent class model with those expected given the multinomial 
model. The degrees of freedom for this test are equal to K minus the 
number of estimated latent class model parameters. 

The assumption of a common multinomial is not necessary, however, 
to use the difference likelihood ratio chi-square statistic for 
comparison of alternative latent class models. For nested models, this 
may be calculated and tested for significance as before, with degrees of 
freedom equal to the difference in the number of estimated parameters. 
The normed fit index may also be calculated and used as before. 



26 



- 18 



Example 

We illustrate these methods with the Yerushalmy data previously 
shown in Table 1.1. Three models, with two, three, and four latent 
classes, designated H^t ff^j and ff^, are estimated.^ Table 2.2 contains 
expected frequencies for each model. The correspondence of expected and 
observed frequencies is seen to increase with the number of latent 
classes. Fit indices are shown in Table 2.3. The two-class model does 
not fit well. Chi-squara statistics for a three-class model are 
statistically significant, suggesting lack of fit, but this is partly 
due to the sample size. The likelihood ratio chi-square for a one- 
class independence model is 7160.808, resulting in a normed fit index 
for of 0.997, so that, by this criterion, does provide good fit. 
Model fits the data better than by a statistically significant 
degree, (difference of 21.897 - 0.0'>9 = 21.798, with 3 - 1 = 2 df), 
but, again, this is virtually guaranteed by the large sample size. 

lable 2.2 

OBSERVED AND EXPECTED RESULTS FOR YERUSHALMY RATING DATA 



Number of 
Positive 
Ratings 7 



Observed 
Frequency 



Expected Frequency e^. 
Model 



0 


13560 


13452 


90 


13557 


27 


13559 


99 


1 


877 


1090 


14 


883 


24 


877 


02 


2 


168 


45 


27 


146 


65 


167 


91 


3 


66 


25 


08 


92 


25 


66 


29 


4 


42 


55 


10 


42 


.24 


41 


25 


5 


28 


79 


.94 


16 


.39 


29 


.05 


6 


23 


72 


.49 


21 


A 


02 


.13 


7 


39 


37 


.56 


50 




19 


.64 


8 


64 


8 


.51 


56 


. /u 


63 


.73 



^We estimate these models using the PANEL program. 
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Table 2.3 

FIT OF ALTERNATIVE LATENT CLASS MODELS 
OF YERUSHALMY DATA 



Model 


df 


Pearson 
Chi-Square 


Likelihood Ratio 
Chi-Square 


Normed 
Fit 
Index 




5 


874.201 


528.495 


0.926 


h 


3 


22.473 


21.897 


0.997 




1 


0.099 


0.099 


1.000 



Table 2.4 

PARAMETER ESTIMATES FOR THREE-CLASS MODEL 
OF YERUSHALMY DATA 



Latent Class 


Prevalence 


Conditional Positive 
Rating Probability 

l|s 


1 


0.9636 


0.0072 




(0.0027) 


(0.0003) 


2 


0.0275 


0.2660 




(0.0024) 


(0.0177) 


3 


0.0088 


0.9003 




(0.0008) 


(0.0134) 


NOTE: Standard 


errors are shown 


below estimates in 



parentheses . 

We accordingly focus our attention on (Table 2.4). For 
illustration, we assume that the three latent classes consist of two 
negative classes and one positive class with respect to tuberculosis, 
for example, (1) unaffected cases, (2) cases with less serious 
conditions that have an elevated probability of being diagnosed 
positive, and (3) cases with tuberculosis. Since there is only one 
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positive latent class, Equation (2) reduces to make rating sensitivity 
equal to '^j^^^* estimated as 0.9003. From Equation (3), rating 
specificity is estimated as ((0.9636)(1 - 0.0072) + (0.0275)(1 - 
0.2660)1/(0.9636 + 0.0275) = 0.986.^ From Equations (4) and (5), the 
positive predictive validity of diagnosis is estimated as approximately 
0.357, and negative predictive validity as 0.999. 

We also use parameter estimates to determine probable diagnostic 
status given combinations of positive and negat j ;e ratings. From 
Equation (6) we estimate the probability of tuberculosis given one 
positive and one negative rating as 0.061. Since the probability of 
tuberculosis given one positive rating, /V^, is estimated as 0.357, we 
see tL difference that a second negative rating makes.' Similarly, 
given five positive and three negative ratings. Equation (6) results in 
an estimated probability of 0.263 of a positive case. 

Finally, we consider how many opinions are necessary to make a 
diagnosis with required accuracy. Suppose that we define sufficient 
accuracy as a positive predictive validity of at least 0.90. Use of 
Equation (6) results in estimated predictive validities of 0.781, 0.925, 
and 0.977 for a positive diagnosis based on unanimous positive ratings 
by two, three, and four diagnosticians, respectively. We would 
therefore need a minimum of three ratings to obtain the necessary 
accuracy. 

We have «.hus shown how, by the latent class approach, agreement 
data can be used to address the practical questions concerning ratings 
initially posed. We next consider a version of these methods applicable 
to fixed panel designs. 



^We base calculated values on four-place accuracy of parameter 
estimates; rounding error may therefore occur. 

'This illustrates the practical value of the latent class modeling 
approach. One would of course expect a lower probability of a disorder 
given that the second opinion is negative. But without such an approach 
it would not be possible to determine by how much it is reduced. 
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FIXED RATING PANEL 

In a fixed panel design the same raters are used to rate each case, 
corresponding to whit is also commonly called a fully crossed rating 
design. This design is useful in that it usuall> requires fewer raters, 
and provides information about the comparative accuracy of individual 
raters. 

Discussions of fixed panel latent class agreement models may be 
found in Bergan [40], C^ogg [41], Dawid and Skene [31], Dillon and 
Mulani [33], Espeland and Handelman [34], Uebersax and Grove [42], and 
Walter and Irwig [43]* The fixed panel agreement model corresponds 
closely to traditional latent class analysis applications as described 
by Lazarsfeld and Henry [2], Goodman [3], and Haberman [4]. 

Model 

We again assume that each of a sample of V cases is rated by a 
panel consisting of k raters. We now assume, though, that the raters 
are the same for each case, and are numbered J=l, 2, ...,Ar. 

Let a positive rating again be represented by 1 and a negative 
rating by 0. Let the vector U^, be one of / (/ = 2 ) unique patterns of 
positive and negative ratings (see Table 2.6), whose jth element, u^^, 
corresponds to the rating of the Jth rater. As before, let s denote one 
of c latent classes to which a case may belong, and let the prevalence 
of latent class s be v . 

5 

Again, latent classes are defined such that all cases belonging to 
the same latent class have the same probability of eliciting a positive 
rating. However, we now allow this probability to be different for each 
rater. To accommodate this, a slightly different notation is adopted. 
Specifically, let , (s = 1, 2, c; J = 1, 2, . . . , Ar) be the 

1 1 Sj 

conditional probability of a case belonging to latent class s being 
rated positive by rater J. 

Given if^ and 'jj^j parameters, we may calculate the joint 
probability of a case being a member of latent class s and receiving 
rating pattern u^,- This, denoted by if^^, is calculated as 

30 
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The exponents u and 1 - u function siich that either if, , , or 1 - 

1T. I . are counted in calculating the joint probability, depending on 
1 1 sj 



whether the JtYi rater's rating is positive or negative. The expected 
fr€ 
by 



frequency of each rating pattern, e^. (i - 1, 2, 7), is then given 



c 

e. = V Z IT. . (8) 
1 s-1 



Estimation, Identifiability, and Assessment of Model Fit 

The results of ratings by k raters across cases may be summarized 
by the number of times each rating pattern occurs, i.e., a set of 
observed frequencies, (i = 1, 2, 7). The purpose of estimation 

is to obtain estimates for and '^y\sJ P^^^"^^^^^^ that lead to expected 
frequencies as close as possible to observed frequencies. Again, the EM 
algorithm can be used to obtain maximum likelihood estimates. 

The subject of identif iability for this class of models is fully 
discussed by Goodman [3] in the context of the general latent class 
model. As in the varying panel case, there are c - 1 prevalence 
parameters, but there are ck conditional rating probability parameters 
(one for each combination of rater and latent class), making th(% total 
number requiring estimation c(/r + 1) - 1. For unconstrained models, a 
unique solution therefore requires that 7 ^ c(/r + 1). Again, thi^:^ is a 
necessary but not a sufficient condition. In the cas3 of two latei:t 
classes (see Table 2.5), three raters are required, which is consistent 
with this formula. For a three-class model, however, five raters are 
required, even though this formula suggests that four would be enough. 
The general method for establishing model identif iability is by 
evaluating the rank of the matrix of derivatives of pattern 
probabilities with respect to model parameters [3], or the rank of the 
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Table 2.5 

MINIMUM RATINGS PER CASE REQUIRED FOR MODEL IDENTIFIABILITY 
AND ASSESSMENT OF FIT: FIXED PANEL DESIGN 





Number of Raters 


Number of Raters 




Number of 


Required for Model 


Required for Chi- 




Latent Classes 


Identifiability 


Square Test 


df 


1 


1 


2 


1 


2 


3 


4 


6 


3 


5 


5 


14 


4 


5 


5 


8 



NOTE: Assumes unconstrained model; degrees of freedom are 
for X* or test with minimum required number of raters. 



matrix of second derivatives of the log-likelihood function with respect 
to model parameters. This test is automatically performed by standard 
latent class analysis programs. If a model is found to be not 
identifiable, the number of estimated parameters uiust be reduced, either 
by decreasing the number of latent classes or by imposing constraints on 
parameters . 

As with the varying panel model, one useful type of constraint is 

to require certain parameters to be equal to fixed values. Another 

useful constraint for fixed panel models is to require that certain 

conditional rating probabilities be equal, e.g., the values of ir, , , be 

l\sj 

the same across raters for a particular latent class. 

Model fit is again assessed with the X* or chi-square statistic. 
The formulas are the same, except that the number of observed and 
expected frequencies now equal 7, and the degrees of freedom for the 
statistics now equal 7-1 minus the number of estimated parameters. 
The normed fit index may also be calculated as before. 
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Applications 

Parameter estimates can again be used to estimate rater accuracy. 
Ve again assume that latent classes are interpretable as varieties of 
negative and positive cases, latent classes 1 through a corresponding to 
negative cases and latent classes b through c to positive cases, and 
understand that when such a simple differentiation of latent classes is 
not possible the procedures described below may be suitably modified. 

The accuracy of individual raters may be expressed using the 
indices discussed earlier, 5e, 5p, Pv+^ and Pv- ^ subscripts being added 
to denote values for each rater. These are obtained from Equations (2) 
through (5), with estimates of ''^ii^^- used in place of those of 
Resulting values may also be averaged across raters, providing mean 
accuracy indices. 

One may again use parameter estimates to classify cases based on 
multiple ratings. Recalling the definition of if^.^ as the joint 
probability of a case belonging to latent class s and receiving rating 
pattern u^., the probability of a case being positive given this pattern 
is 

c 

I If. 
s«2> J5 

Pr(+|u ] = . (9) 

^ c 

I IT. 

s-1 J-s 

An important aspect of Equations (7) and (9) is that they lead to 
different probabilities of a case being positive depending on which 
raters mak3 positive and negative ratings. We discuss implications of 
this in the example below. 

Software 

The fixed panel latent class agreement model can be implemented 
using standard latent class analysis programs such as Clogg's MLLSA [36] 
and Haberman's LAT [4]. The PANMARK program of van de Pol, Langeheine, 
and de Jong [44], though primarily intended for Markov model analysis. 
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can also be used for these models. All of these programs are available 
in microcomputer form. 

Example 

It is useful to consider the fixed panel model in an application 

other than diagnosis, since, in fact, the applicability of these methods 

extends far beyond that context. We consider ratings on the 

appropriateness of 859 possible indications for performing the procedure 

carotid endarterectomy by a panel of medical experts, gathered in a 

study described by Park et al. [45]. For present purposes, we recode 

ratings, originally made on a nine^-point Likert-type scale (1 = 

extremely inappropriate indication; 9 = extremely appropriate 

indication), to dichotomies, a positive rating corresponding to a judged 

indication and a negative rating to a nonindication. The observed 

frequencies of all possible rating patterns among five raters are shown 

in Table 2,6. We consider models with two, three, and four latent 

classes, designated Af^, M^, and H^. The expected pattern frequencies 

give • each model are also shown in Table 2.6. Fit indices for each 

model are shown in Table 2.7. The and statistics for both and 

are nonsignificant, indicating good fit. 

A one-class independence model yields a value of 1433.925 for L^*. 

Using this to calculate the normed fit index, we see that and M, also 

3 4 

provide good fit by this criterion. With a difference likelihood ratio 

chi-square of 23.059 - 7.534 = 15.525 (16 - 14 = 2 df), the fit of ff, is 

4 

better than that of by a statistically significant amount, but this 
must be weighed against the greater parsimony of M^. 

Parameter estimates for are shown in Table 2.8.' To see how 
these might be used, suppose that the three latent classes are (1) 
nonindications, (2) equivocal indications, and (3) valid indications for 
treatment, and that of interest is, for each rater, the probability of a 
positive rating given a valid indication, or each rater's sensitivity. 

^Parameter estimates shown are from the MLLSA program. The input 
file used to generate these results is shown in the Appendix. Standard 
errors shown are from the PANMARK program. 
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Table 2.6 

0B«5F.RVED AND EXPECTED RESULTS FOR PHYSICIAN RATINGS 
OF TREATMENT APPROPRIATENESS 



Expected Frequency e^. 



Rating Rater Observed Model 

Pattern Frequency 



i 1 2 3 4 5 /. ff- M. M, 

1 2 3 H 



1 


+ 


+ 


+ 


+ 


+ 


69 


35.52 


69. 


25 


68.94 


2 


+ 


+ 


+ 


+ 




2 


3.42 


1. 


85 


2.47 


3 


+ 


+ 


+ 




+ 


4 


9. 14 


4. 


36 


4.43 


4 


+ 


+ 


+ 






1 


0.88 


0. 


17 


0. 16 


5 


+ 


+ 




+ 


+ 


2 


20.20 


2. 


11 


2.24 


6 


+ 


+ 




+ 




1 


1.96 


0. 


25 


0.76 


7 


+ 


+ 






+ 


0 


5.23 


0. 


59 


0.00 


8 


+ 


+ 








0 


0.51 


0. 


14 


0.00 


9 


+ 




+ 


+ 


+ 


82 


102.93 


80. 


75 


81. 13 


10 


+ 




+ 


+ 




4 


10.11 


9. 


90 


4.70 


11 


+ 




+ 




+ 


23 


26.91 


23 


69 


25.29 


12 


+ 




+ 






8 


4.34 


6 


52 


6.92 


13 


+ 


- 


- 


+ 


+ 


67 


59.59 


63 


80 


66.71 


14 


+ 






+ 




24 


10.33 


19 


50 


24.31 


15 


+ 








+ 


42 


23.65 


45 


.72 


40.81 


16 


+ 










41 


55.48 


41.41 


41.14 


17 




+ 


+ 


+ 


+ 


0 


1.08 


0 


04 


O.OO 


18 




+ 


+ 


+ 




0 


0.10 


0 


01 


0.00 


19 




+ 


+ 




+ 


0 


0.28 


0 


.03 


0.00 


20 




+ 


+ 






0 


0.03 


0 


.01 


0.00 


21 




+ 




+ 


+ 


0 


0.62 


0 


.09 


0.00 


22 




+ 




+ 




0 


0.06 


0 


.02 


0.00 


23 




+ 






+ 


0 


0.16 


0 


.06 


0.00 


24 




+ 








0 


0.02 


0 


.02 


0.00 


25 






+ 


+ 


+ 


5 


3.30 


3 


.56 


2.61 


26 






+ 


+ 




0 


1.34 


1 


.51 


1.14 


27 






+ 




+ 


8 


2.69 


3 


.32 


8.66 


28 






+ 






8 


12.13 


9 


.04 


7.57 


29 








+ 


+ 


5 


6.74 


9 


.95 


7.02 


30 








+ 




28 


31.92 


26 


.41 


26.98 


31 










+ 


49 


58.16 


48 


.69 


48.17 


32 












386 


370.39 


386 


.25 


386.86 


SOURCE: 


Park 


et 


al. 


[45]. 












NOTE: 


Total N 


of 


859, 


Columns may 


not sum to 


total due to 





rounding. 
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Table 2.7 

FIT OF ALTERNATIVE LATENT CLASS MODELS OF 
TREATMENT APPROPRIATENESS RATINGS 







Pearson 


Likelihood Ratio 


Normed 






Cb 1 -Square 


Chi -Square 


Fit 


Model 


df 






Index 


h 


21 


126.347 


130.496 


0.909 




16 


24.085 


23.059 


0.984 


h 


14 


9.248 


7.534 


0.995 


NOTE: 


Degrees of fre<3doin shown 


are obcained from 


the 



MLLSA program, which treats parameter estimates of 0 or 1 
as constrained, reducing the number considered estimated- 



Table 2.8 

PARAMETER ESTIMATES FOR THREE -CLASS MODEL 
OF TREATMENT APPROPRIATENESS RATINGS 



Latent Conditional Positive Rating Probability 
Class Prevalence 



s 


\ 




"l|sl 


''l|s2 


^|s3 


^|s4 


^|s5 


1 


0.5838 


0 


.0712 


0.0000 


0.0213 


0.0596 


0.1023 




(0.0219) 


(0 


.0183) 




(0.0081) 


(0.0121) 


(0.0165) 


2 


0.2625 


0 


.8972 


0.0118 


0.3277 


0.5967 


0.7805 




(0.0224) 


(0 


.0341) 


(0.0154) 


(0.0565) 


(0.0497) 


(0.0440) 


3 


0.1537 


1 


.0000 


0.5783 


0.9806 


0.9437 


0.9752 




(0.0212) 






(0.0710) 


(0.0274) 


(0.0285) 


(0.0183) 



NOTE: Standard errors are shown in parentheses. Estimates of 1 or 0 
indicate convergence to a boundary value [3] (see also NOTE for previous 
table); for these, standard errors are not calculated. 



These are equal to the estimates of '^^y shown in Table 2.8. Thus, 
estimated rater sensitivities, Se^ Se^, Se^, 5e^, and Se^, of 1.0000, 
0.5783, 0.9806, 0.9437, and 0.9752, are obtained. Following the 
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procedure for calculating positive predictive validity, we obtain 

estimate?" of 0.357, 0.966, 0.605, 0.431, and 0.362 for /W^, Pv^^^ 

Pv^y, and PiH-.. 
4 5 

From Equations (7) and (9) we see that the pr'^bability of a 
possible indication being a true indication given five positive ratings 
is 0.995. Suppose, however, that of the five ratings, four are positive 
and one is negative. The probability of a true indication now depend.s 
on which rater makes the negative rating: if it is Rater 4, for 
example, we obtain an estimate of 0.943; however, if it is Rater 2, we 
estimate the probability as only 0.622. 

It is by its ability to combine opinions in an explicit and 
probabilistically correct way that the fixed panel latent class 
agreement model demonstrates perhaps its greatest value relative to 
traditional ways of interpreting panel ratings. For example, the non- 
Bayesian view might hold a rating pattern of {+, -,+,+, +} to just as 
strongly indicate a positive case as a pattern of {+, +,+,-,+}. 
However, this is neither probabilistically correct, nor necessarily the 
way we really interpret multiple opinions. If one rat'^r tends to make 
positive or negative ratings more often than others, we are likely to 
take this information into account. All other things being equal, a 
positive rating by a conservative rater gives us greater cause to 
believe that a case is positive than one by a nonconservat ive rater. 
An important limitation of traditional methods for interpreting multiple 
rater opinions is that they do not take this into account. 

This also suggests why it may be useful to include in panels both 
conservative and nonconservat ive raters. If the need arises to identify 
a positive case with a high degree of certainty, one may be selected 
that even conservative raters rate as positive. Conversely, negative 
ratings by nonconservat ive raters may be useful when there is a need to 
identify a case as negative with a high degree of certainty, 

**An interesting consequence of this is the opportunity it provides 
for "gaming" by raters. For example, if a rater wanted to ensure that a 
positive rating carried the most weight, it would be advantageous to 
make positive ratings sparingly up to that point-^thus appearing 
conservative. A positive rating would then be interpreted as stronger 
evidence of a positive case. 
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Again, we have considered only some of the applications that the 
fixed panel latent class agreement model permits. 

DIRECTIONS FOR FUTURE RESEARCH 

We believe that the methods described in this section offer many 
advantages for the analysis and interpretation of rater agreement, and 
recommend their use. One possible concern is the assumption of cases as 
falling into only a small number of latent classes. Although it would 
naturally be more appealing to think of disorders as displaying instead 
continuously varying levels of a latent trait, latent class models 
appear to provide a suitable approximation for a large number of 
applications. 

There are several areas where additional research would be helpful. 
The extent to which sample size affects the accuracy of parameter 
estimation needs to be investigated; simulation studies may prove 
helpful in determining this. Generalizations of these methods may also 
increase their range of application. It should be possible to adapt the 
fixed panel model to allow for missing observations, or rotating panel 
designs where the raters rating each case are systematically varied. We 
have considered only dichotomous ratings here, but latent class models 
for polytomous ratings have also been discussed [31, 33, 40, 41, 46], 
Latent class models can also be used to analyze agreement on ordered 
response category or Likert-type ratings [47]. 
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ML LATENT TRAIT AGREEMENT ANALYSIS 

The methods in this section are related to the statistical 
techniques of item response theory [22, 48] and Rasch modeling [49], 
which together may be subsumed under the more general heading of latent 
trait analysis [2]. We term the use of these methods in the analysis of 
agreement Latent Trait Agreement Analysis, Related discussions may be 
found in Fleiss [50] and Kraemer [37], and Quinn [51] has recently shovn 
that equivalent models may be derived from signal detection theory [52]. 

VARYING RATING PANEL 
Modal 

We begin by assuming a continuous dimension of trait intensity or 
severity. Thh location of a case on this continuum we term its latent 
trait level, and denote by 6. The word "trait" is used broadly, and it 
is understood that the continuum may also be an aggregate dimension 
based on several traits or symptoms. 

The latent trait agreement model may be understood in terms of two 
functions (Fig. 3.1). The first, /(B), describes the probability of 
encountering a case at each latent trait level 6. The second, p(8), 
describes the probability, given a case at level 6, a positive 
rating. We term /(B) the trait probability density function, and p(9y 
the probability of positive rating (or diagnosis) function. 

The probability of a randomly selected case being rated positive is 
equal to a weighted average of p(e) over all levels of 6, where the 
weight is the probability of a case having trait level 6, i.e., /(B). 
Thus, it is equal to the product of /(B) times p(B) summed across the 
range of B, or the integral of /(B)p(e) over all levels of B. 

If /(B) and p(B) were known, they would lead directly to estimates 
for the probability of various patterns of agreement and disagreement by 
multiple raters. For example, given a case at level B, the probability 
of two positive ratings is p(B)*; for a randomly selected case, this 
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Latent Trait Level 



Fig. 3.1--Varying panel latent trait agreement model: 
trait probability density function, /(6), and probability 
of positive ruling function, p(6), given a continuum 
of latent trait intensity or severity, 6. Dotted lines 
correspond to weighted (by prevalence) probability 
functions of negative (left) and positive (right) cases. 

probability is thus equal to the integral of fiB)p(B)^ over all levels 
of 9. Similarly, the probability of two positive ratings and one 
negative rating is equal to the integral of f(B)p(Q)^[l - p(Q)] over all 
levels of 9. Generalizing this, the probability of exactly J positive 
ratings by k randomly selected raters is 



Multiplying Pr(j' = J] times the number of cases in an agreement study, 
W, gives the expected number of cases with exactly J positive ratings, 




(10) 



where J is a variable to denote the number of positive ratings. 



ej 0 = 0, 1, k). 
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Knowing only /(B) and p(8), therefore, it is possible to predict 
the results of i rater agreement study. Conversely, given certain 
assumptions about their general forms, one can use the results of a 
study to estimate these functions. What we propose, therefore, is as 
follows: first, agreement data are used to estimate /(B) and p(8); then 
these functions are used to estimate rater accuracy and provide a basis 
for combining multiple ratings. 

As a plausible way of approaching the initial estimation problem, 
we begin by assuming that there are two types of cases constituting a 
population, positive and negative cases, each normally distributed with 
respect to the trait continuum. Specifically, let /j(e) and /2(^) 
normal distributions describing the unconditional probabilities of 
negative and positive cases, respectively, occurring at each level of 8, 
with /j(8) be^^g defined by mean and standard deviation Oj, and /2(8) 
by Vi^ and a^. That is, /j(8) is the p -^ba. Uity of sampling a case that 
is both negative and at trait level 8, and /2(8) is the probability of 
sampling a case that is both positive and at trait level 8. ihese 
functions are not probability density functions per se, since their 
integrals do not equal 1. Rather, they are the product of the 
probability density function*' for negative and positive cases multiplie' 
by their corresponding prevalences. The sum of these functions, /(8) = 
/j(6) + f^i^)^ provides the latent trait probability density function. 

Derivation ^' the Probability of Positive .;ating Function 

We now consider the function p(8). Let each rater he assumed to 
lave a rating threshold ^ or some point along 8 such that cases with a 
trait level at or above this point aie rated positive, and those below 
ratea negative. In the varying panel case, let the thresholds of a 
population of raters be assumed normally distributed and described by 
the probability density function t(8), with mean and standard 
deviation 0^. The cumulative distribution function of t(8) gives the 
r obability that the threshold of a randomly selected idter is at or 
below each level of 8. This is equivalent to the probability of a case 
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With trait level 9 equaling or exceeding the threshold of a rand mly 
selected rater, and therefore being rated positive, Thns, this 
cumulative distribution funccion is equal to the probability of positive 
rating function, p(6). 

Estimation 

We have therefore developed a model which provides the general form 
for /(B) and p(e). According to this model, /(B) and pCB) depend only 
on the means and standard deviations of /^(B) and /jCB) (y^, Oj, and 
0^), the mean and standard deviation of t(B) (y^ and o ), and the 
prevalences of positive and negative cases (which we designate P and 1 - 
Py respectively). Only one prevalence must be estimated, since they sum 
to 1. Also, either of the means and either of the staiidard deviations 
for /j(B) and /jCB) can ^ .hosen arbitrarily. Knowledge of as few as 
five parameters, therefore, allows estimation of /(B) and p(B). 

For a set of parameter values, we may determine the probability of 
each number of positive ratings given * ratings per case with Equation 
(10). Given observed frequencies fj (7=0, 1, *) for the number 

of case*' 7ith each number of positive r^itings, we then calculate the 
log-likelihood of the joint outcome as 

' jto 0 ^""^^^ " ^^^^ 

From the results of a rater agreement study, therefore, we may use 
numerical procedures to obtain maximum likelihood estimates for model 
parameters. Specifically, the maximum likelihood estimates of model 
parameters are those that maximize Equation (11). Uebersax [26] 
described the use of the Newton-Raphson method to obtain estimates for 
this model. For the Newton-Raphson procedure to converge effectively, 
however, it is usually necessary to apply an initial grid-search 
algorithm, which tests all combinations of parameter values using a 
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relatively coarse resolution, to find starting values in the vicinity of 
maximuin likelihood estimates.^ 

As in the previous section, model fit may be assessed by comparison 
ol observed and expected outcome frequencies using a or test, with 
degrees of freedom equal to k minus the number of estimated parameters* 

Estimating Rater Accuracy and Related Applications 

Knowledge of /(6) and p(9) and their component parameters permits 
inferences concerning rating accuracy. For example, since the 
probability of a case at trait level 6 being rated positive is pCB), the 
conditional probability of a randomly selected positive case being rated 
positive, i.e., 5e, is equal to the integral of /2(e)p(e) over all 
levels of e, divided by P. Similarly, Sp is equal to the integral of 
/j(e)[l - p(Q)] over all levels of 6, divided by 1 - P. Uebersax [26] 
shows similar formulas for positive and negative predictive validity. 

Combining Multiple Opinions 

Once estimated, /(B) and p(e) can also be used to classify cases 
based on multiple ratings. The probability of a case being positive, 
given J out of k positive ratings is 

//2(e)p(e)^;i - p(e)]*'^' de 

Pr(+U' = J] = ^ . (12) 

//(e)p(e)^'[i - p(e)]*'-^' </e 

We may also use Equation (12) with different values of k to derive 
classification rules for futures cases. 

Uebersax [26] considers a computational example of the varying 
panel latent trait agreement model, so we do not present one here. 

^Preliminary research suggests that it may be possible to eliminate 
the grid-search algorithm by the use of a "hybrid" estimation algorithm 
that combines the EM and Newton-Raphson methods . 
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FIXED RATING PANEL 
Model 

For this model, e^ch rater is taken to have a characteristic 
threshold for making a positive rating. This threshold, however, is 
assumed subject to random variation, described by a norjial probability 
distribution of values around a mean. The cumulative distribution 
function of this probability distribution gives the probability of a 
case at each level of 6 equaling or exceeding the threshold of that 
rater, and a positive rating being made. Thus, associated with each 
rater ^ is a probability of positive rating function Pj(B) having the 
shape of a normal cumulative distribution function and centered at the 
point on 6 corresponding to that rater's mean threshold (Fig, 3.2) • 
Following a standard technique in item response theory, we assume 
probability of positive rating functions to have the shapes of logistic 
ogives, which closely approximate normal cumiUative distribution 



Fig. 3.2--Fixed panel latent trait agreement model: the 
probability of positive rating functions of three hypo- 
thetical raters are superimposed on the trait probability 
density function, /(6); the b values correspond to each 
rater's mean threshold. 
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functions. The logistic function for each rater j depends on two 
parameters, Bj and bjy which correspond to the variability and mean 
value, respectively, of that rater's threshold. Specifically, this 
function is defined as p^O) = 1/{1 + exp[-1.7a^(e - bj)]). 

We define /(B), /j(e), and f^{^) in the same way as for the varying 
panel model. 



Estimation 

As with the latent class fixed panel model, positive and negative 
ratings by k raters may correspond to one of / = 2 patterns. The 
probability of the ith such pattern, U^., occurring is 



Pr(v = u^.] = //(B) ^.n^ p.{%) - p^o)] </e, 



(13) 



where V is the vector of observed ratings and u^j again corresponds to 
the rating of the JtYi rater, coded 1 or 0. 

The expected frequenc*' of pattern i, e^., is obtained by multiplying 
results of Equation (13) times the number of cases rated, V. The log- 
likelihood for the joint outcome of an agreement study is therefore 



log L = I / . log Pr[V = U .] , (14) 
i«l J J 

where is defined as in the fixed panel latent class model. 

Maximum likelihood estimates of parameters are again those that 
maximize log L and may be numerically obtained. If threshold 
variability is assumed constant across raters though this is not an 
assumption one would make in all applications), the number of parameters 
necessary to estimate may be reduced to /: + 4: a mean tnrechold {bj) 
for each rater, within-rater threshold variability (a), the meai. of 
either positive or negative cases on the latent trait continuum (y^ or 
y^)* the standard deviation of either positive or negative cases (o^ or 
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Oj), and the prevalence of positive cases (P) . For unique estimates, 
this number must be less than or equal to / - 1. Model fit is assessed 
by comparison of observed and expected pattern frequencies with a or 

test with degrees of freedom equal to 7 - 1 minus the number of 
estimated parameters. 

Applications 

The sensitivity, specificity, and positive and negative predictive 
validity of each rater's ratings can be estimated in the same way as 
with the varying panel model, using the individual probability of 
positive rating functions p^.(8) in place of p(6). These can also be 
averaged across raters to provide mean accuracy indices. 

Parameter estimates can again be used to classify a case as 
positive or negative based on its ratings. For example, the joint 
probability of a case being positive and receiving pattern u^, which we 
denote Pr[u^., +] , is obtained from Equation (13), using f^i^) in place 
of /(8). The probability of a positive ce-e given is then equal to 
Pr[U^., divider by Pr[V = u^.] . 

Example 

We illustrate this model with the hypothetical data in Table 3.1. 
These correspond to a study in which four diagnosticians rate 497 cases 
for presence or absence of a disorder. To reduce the number of 
estimated parameters, we assume = o,, = o = 1. We also assume a, = a, 
" ^3 " *4 " ^> i.e., that threshold variability is constant across 
raters. An arbitrary value of 0 is taken for y^. Thus, the parameters 
requiring estimation are y^, P, a, Z)^, b^, b^, and b^. Initial 
estimates art obtained by a grid-search algorithm. V ing these as 
starting values, a Newton-Raphson algorithm provides the maximum 
likelihood estimates shown in Table 3.2. 

Expected frequencies for each rating pattern given these estimates 
are shown in Table 3.1. Comparison of these with the observed 
frequencies results in values of 6.42 and 6.75 for X* and 
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Table 3.1 

RESULTS OF HYPOTHETICAL DIAGNOSTIC AGREEMENT STUDY 



Rating Diagnostician Observed Expected 

Pattern Frequency Frequency 
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Table 3.2 

PARAMETER ESTIMATES FOR FIXED PANEL 
LATENT TRAIT AGREEMENT MODEL 



Parameter Estimate Standard Error 
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respectively. With 15 - 7 = 8 df , these are both nonsignificant at the 
0.5 level, indicating good fit. 

From these parameter values, sensitivities for Raters 1 through 4 
are estimated as 0.92, 0.74, 0.51, and 0.41, and specificities as 0.52, 
0.81, 0.92, and 0.95. Estimated mean sensitivity and specificity across 
raters are 0.65 and 0.80, respectively. 

DIRECTIONS FOR FUTURE RESEARCH 

We have considered two latent distributions, one corresponding to 
negative and one to positive cases. However, it is possible to 
generalize this approach. For example, positive cases may consist of 
two subtypes, each normally distributed on the latent trait continuum. 
In some applications it might make sense to consider cases as following 
a single distribution [26]. There is also no need to require normal 
distributions; different parameterized distributional forms may also be 
cons idered. 

We believe that significant improvements are possible for the 
estimation of these models. For example, marginal maximum likelihood 
estimation [53] may prove useful. 

The questio*! naturally arises of whether latent class or latent 
trait agreement models would be better for a given set of data. 
Ideally, both approaches could be used and a selection made on the basis 
of which provides better fit. However, although formal statistical 
methods for comparing the fit of nested models exist, there is no 
generally accepted method for comparing qualitatively different models; 
research in this area, though, is proceeding (see, for example, Ref. 

Because the -»stimation procedures and software are better developed 
for the latent class agreement model, we would generally recommend that 
investigators pursue that approach first. 
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Appendix A 

SAMPLE INPUT FILE FOR FIXED PANEL LATENT 
CLASS AGREEMENT MODEL 



'fhe following shows an input file for estimating model of the 
Park et al. [45] treatment appropriateness rating data using the MLLSA 
latent class analysis program [36]: 
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